Welcome to the Data Science World

This course is divided into three Parts

  • Part 1: Getting Started with R for Data Science
    • Introduction to Data Science.
    • CRISP-DM
    • Data Scientist's Tools.
    • R Programming Language.
  • Part 2: Data Wrangling & Statistics
    • Reading Data from different Sources.
    • Handling inconsistent,missing Data & Anomolies
    • Data Normalization & Transformation
    • Dimension Reduction
    • Intro to Statistics

Welcom to DS.. Continue

  • Part 3: Modeling and Evaluation
    • Supervised Learning
      • linear Regression
      • logestic Regression
      • Decision Trees
      • Random Forest
    • Unupervised Learning
      • K-NN Clustering
      • K-Mean Clustering
      • Hierarchical Clustering
    • Association Rules

Data Are everywhere

Why Study Data Science?

Dell Sales Reps Productivity

Problem Sales reps spend a lot of time doing online research and gathering information from disparate sources before they actually engage with a sales prospects
Data Used LinkedIn and other public sources available on the Web
Solution Predictive model called 'Lattice Engines'
Benefits Productivity Increased by 100%

Read full story here

Massachusetts Medicaid fraudulent Claims

Problem There were many fraudulent medicaid claims that has been costing Massachusetts State millions of dollars
Data Used Massachusetts’ Medicaid Management Information System (MMIS) and analyzes public and other Commonwealth data stored in a data warehous
Solution Predictive model called 'NetReveal'
Benefits Recovered 2 million dollars in first six months

Read full story here

IBM & Epic Predective Model saves lives

Problem Carilion Clinic will have to spend years to go over their data to identify patients with risk of heart failure. By then it will be too late
Data Used IBM’s data warehouse, and Epic electronic health records
Solution Watson predictive model and natural language processing
Benefits Identified 8,500 patients who are at risk of congestive heart failure within one year

Read full story here

President Obama’s campaign

Problem Obama Campaign Staff needed to focus on the people who are more likely to vote and make sure they vote
Data Used Web, social media and collected Campaign Offices data
Solution Complex micro-targeting model
Benefits Predicted that Obama would receive 56.4% of the vote; the Obama share of the actual vote was 56.6% in Ohaio. Eventually made Obama president.

Read full story here

What is Data Science?

Data Science

is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. ~ Wikipedia

What are the types of Data?

  • Structured Data
    • Data that has pre-defined format. We mainly refer to the data that can be stored in tabular format.

What are the types of Data?

  • Unstructured Data
  • Those include everything else, from texts on websites and social media to uploaded videos and music.

Structured VS Unstructured

Structured VS Unstructured

Dealing with unstructured data is beyond the scope of this course. So we will focus on Structure data in this course.

Are all data equal?

The answer is : NO

Are all data equal?

Tasks of Data Scientist

Data Scienctists are invlolved in a mainly 5 tasks in their daily work. They try to

  • Describe the data.
  • Estimate or predict it.
  • Classify it.
  • Cluster it.
  • Idenfity Association within it.

We will explore those 5 tasks in the next slides in details.

Tasks of Data Scientist

  • Description
    • This involved conducting Exploratory Analysis & Descriptive statistics.
    • The data scientist is only summrizing the data. No interpertatins is done at this task level.

Example of Description

Here is a students Data that includes GPA and GRE scores, rank of the school they applied to and whether they got admited or not.

##  admit          gre             gpa        rank   
##  YES:127   Min.   :220.0   Min.   :2.260   1: 61  
##  NO :273   1st Qu.:520.0   1st Qu.:3.130   2:151  
##            Median :580.0   Median :3.395   3:121  
##            Mean   :587.7   Mean   :3.390   4: 67  
##            3rd Qu.:660.0   3rd Qu.:3.670          
##            Max.   :800.0   Max.   :4.000

Tasks of Data Scientist

  • Estimation & Prediction
    • We use categorical or/and numerical predictors to predict/estimate numerical target variable
    • In this task we try to uncover trends and understand the relationship between the variables.
    • If we use this knowledge and extend it to new data sets, then we are trying to predict values.

  • Classification
    • We use categorical or/and numerical predictors to predict/estimate categorical target variable

  • Clustering
    • Grouping of records, observations, or cases into classes of similar objects. No need for target variable

  • Association.
    • Finding which attributes "go together." Most prevalent in the business world, where it is known as market basket analysis

CRISP-DM

CRISP-DM : Cross-industry standard process for data mining

CRISP-DM

  1. Business/Research Understanding Phase
      - First, clearly define the project objectives and requirements
        in terms of the business or research unit as a whole.
      - Then, translate these goals and restrictions into the 
        formulation of a data science problem definition.
      - Finally, prepare a preliminary strategy for achieving 
        these objectives

Business/Research Understanding Phase

Business Understanding Example

  • An oil and gas company is seeking to reduce the operation fatalities.
    • Classification Problem.
  • A retail store want to identify their costumers market segments.
    • Clustering Problem.

CRISP-DM

  1. Data Understanding Phase
      - First, collect the data
      - Then, use exploratory data analysis to familiarize yourself
        with the data, and discover initial insights.
      - Evaluate the quality of the data.
      - Finally, if desired, select interesting subsets that 
        may contain actionable patterns.

Data Understanding Details.

In order to understand data science, it is important to understand the nature of databases, data collection and data organization.

We need to understand the differences between databases, data warehouses, and data sets.

What is common among them is that they all have rows and columns.

Data Understanding Details.

A database is an organized grouping of information within a specific structure.

The tables are related by the single column they have in common: Owner_ID. By relating tables to one another, we can reduce redundancy of data and improve database performance. The process of breaking tables apart and thereby reducing data redundancy is called normalization

Data Understanding Details.

Most relational databases which are designed to handle a high number of reads and writes (updates and retrievals of information) are referred to as OLTP (online transaction processing) systems

It is not useful to analyze that OLTP data directly. We need to query the tables to get meaningful data. Queries are usually written in a language called SQL (Structured Query Language; pronounced ‘ sequel’)

Querying OLTP is usually very intensive and time consuming on even the most robust computers. That's why we use Data warehouse

Data Understanding - Practical Example

Northwind Database

Data Understanding Details.

A data warehouse is a type of large database that has been denormalized and archived. Denormalization is the process of intentionally combining some tables into a single table in spite of the fact that this may introduce duplicate data in some columns (or in other words, attributes). Systems that perform this process are called OLAP (online analytical processing).

Data Understanding Details.

Becauase we data where houses are huge, we usually don't work on all of it at once. We pick a subset from this large table. Those subsets are called Data Sets.

a subset of data that is designed to a business function is called Data Mart.

Data Understanding Details - Summary

CRISP-DM

  1. Data Preparation Phase
      - This labor-intensive phase covers all aspects of preparing
        the final data set, which shall be used for subsequent phases,
        from the initial, raw, dirty data.
      - Select the cases and variables you want to analyze,
        and that are appropriate for your analysis.
      - Perform transformations on certain variables, if needed.
      - Clean the raw data so that it is ready for the modeling tools

CRISP-DM

  1. Modeling Phase
      - Select and apply appropriate modeling techniques.
      - Calibrate model settings to optimize results.
      - Often, several different techniques may be applied for
        the same data mining problem.
      - May require looping back to data preparation phase,
        in order to bring the form of the data into line with
        the specific requirements of a particular data mining technique.

CRISP-DM

  1. Evaluation Phase
      - The modeling phase has delivered one or more models. These models
        must be evaluated for quality and effectiveness, before we deploy
        them for use in the field.
      - Also, determine whether the model in fact achieves the objectives
        set for it in phase 1.
      - Establish whether some important facet of the business or
        research problem has not been sufficiently accounted for.
      - Finally, come to a decision regarding the use of the
        data mining results

CRISP-DM

  1. Deployment Phase
      - Model creation does not signify the completion of the project.
        Need to make use of created models..
      - Example of a simple deployment: Generate automated reports a report
      - Example of a more complex deployment: 
        Implement a parallel data mining process in another department.
      - Finally, come to a decision regarding the use of the
        data mining results

CRISP-DM in the real world

Homework